Appendix D — Joins

If you work with biodiversity data, odds are high that you will need to join 2 separate datasets at some point to test how spatial, temporal or environmental factors influence species. This chapter is a brief overview of several common types of joins dplyr to help get started. For a comprehensive introduction to joins, check out the Joins chapter in R for Data Science.

D.0.1 Prerequisites

In this chapter, we will use pardalote occurrence data from 2015 in the ALA.

library(galah)
library(dplyr)
library(here)
library(ggplot2)

galah_config(email = "your-email-here")

pardalotes <- galah_call() |>
  identify("Pardalotus") |>
  filter(year == 2015) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()

pardalotes_taxonomy <- galah_call() |>
  identify("Pardalotus") |>
  atlas_species()

D.1 Keys

Joining dataframes relies on setting a key—one or more columns that exist in a primary table that correspond to one or more columns in a second table. Two dataframes that we intend to join are matched according to the designated key.

As a basic example, let’s say we want to add complete taxonomic information to our pardalotes dataframe, which contains occurrence records with some, but not all, taxonomic information. pardalotes_taxonomy contains complete taxonomic information for Pardalotus.

Let’s join our pardalotes dataframe with pardalotes_taxonomy. The column scientificName in pardalotes appears to contain (in general) the same information in column species_name in pardalotes_taxonomy.

We can use these columns as a key to join the additional taxonomic information to our pardalotes occurrence records. We can also use the genus column as a second key to match any observations that have been identified down to genus, but not species, level.

pardalotes |>
  left_join(pardalotes_taxonomy, 
            join_by(scientificName == species_name, 
                    genus == genus)
            ) |>
  
  rmarkdown::paged_table() # paged output

D.2 Basic types of joins

There are many types of joins that can help you in all kinds of situations! Join types generally fall within two categories:

  • Mutating joins combine variables from two tables (e.g., left_join(), right_join(), full_join())
  • Filtering joins combine variables, and additionally keep or remove rows that do not match the key column (e.g., semi_join(), anti_join())

Below are a few common examples of join types. Examples and animations are taken from Garrick Aden-Buie’s tidyexplain animations.

x <- tibble(id = c(1, 2, 3),
            x = c("x1", "x2", "x3"))

y <- tibble(id = c(1, 2, 4),
            y = c("y1", "y2", "y4"))

D.3 Spatial joins

It can be useful to summarise where species or taxonomic groups occur by grouping by a spatial region (e.g., state, council area, bioregion). To do so, records or summary statistics need to be linked to the region they pertain to. Usually, this requires a join between a spatial object and a dataframe, or between two spatial objects.

As an simple example, let’s download a shapefile of Australian states and territories using the ozmaps package. Our aus object contains the name of each state/territory (NAME) and its shape (geometry)1.

library(sf)
library(ozmaps)

aus <- ozmap_states |>
  st_transform(4326)

aus
1
This line transforms the CRS projection of our map to match the CRS projection of ALA data.
Simple feature collection with 9 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 2
  NAME                                                                  geometry
* <chr>                                                       <MULTIPOLYGON [°]>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782, 150.6…
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259, 146.6…
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575, 148.85…
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885, 137.3…
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264, 126.3…
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258, 147.8…
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922, 136.3…
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 149.271…
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344, 167.93…

Our pardalotes data also contains the state/territory of each occurrence in column cl22. We can group by state/territory and summarise the number of occurrences to get an overall count by state/territory.

counts_by_state <- pardalotes |>
  group_by(cl22) |>
  count()

counts_by_state
# A tibble: 9 × 2
# Groups:   cl22 [9]
  cl22                             n
  <chr>                        <int>
1 Australian Capital Territory  3982
2 New South Wales               7652
3 Northern Territory             549
4 Queensland                    6687
5 South Australia               1896
6 Tasmania                      1561
7 Victoria                     11123
8 Western Australia             2522
9 <NA>                           160

To prepare our data for mapping, we can join counts_by_state to aus using the state/territory name as our key.

aus_counts <- aus |>
  left_join(counts_by_state,
            join_by(NAME == cl22))

aus_counts
Simple feature collection with 9 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 3
  NAME                                                            geometry     n
  <chr>                                                 <MULTIPOLYGON [°]> <int>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782,…  7652
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259,… 11123
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575, …  6687
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885,…  1896
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264,…  2522
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258,…  1561
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922,…   549
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 1…  3982
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344, …    NA

Now we can use this data to create a choropleth map2.

ggplot() +
  geom_sf(data = aus_counts,
          aes(fill = n)) + 
  guides(fill = guide_coloursteps(title = "Number of\nObservations")) +
  scale_fill_viridis_c(option = "G") +
  theme_void()

The sf package also has specialised functions for spatial joins like st_join(), which can be especially useful for joins using points (e.g., POINT) and shapes (e.g., POLYGON, MULTIPOLYGON). Below is a small example where we use the point location to join with the state/territory. Note that we lose the POINT location in favour of the state MULTIPOLYGON shape, held in the column geometry.

# convert record coordinates to sf POINT class
pardalotes_sf <- pardalotes |>
  tidyr::drop_na() |>
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), 
           crs = 4326)

# join points to aus states that intersect spatially
states_with_species <- st_join(x = aus, 
                               y = pardalotes_sf,
                            join = st_intersects,
                            left = FALSE)

states_with_species |>
  rmarkdown::paged_table() # paged output

  1. This shapefile contains state/territory outlines as vectors. See this section on vectors to learn more about what a vector is.↩︎

  2. For more advanced examples of making choropleth maps, check out the ALA Labs articles counting points in shapefiles and using multiple colour scales.↩︎